1
00:00:00,025 --> 00:00:05,410
[SOUND] Hi, everyone.

2
00:00:05,410 --> 00:00:08,900
Welcome back to the heterogeneous parallel
programming class.

3
00:00:08,900 --> 00:00:13,130
We are at lecture 1.7, kernel-based
parallel programming.

4
00:00:13,130 --> 00:00:17,850
And we will be discussing multidimensional
kernel configuration today.

5
00:00:17,850 --> 00:00:20,110
The objective of this lecture is for you
to

6
00:00:20,110 --> 00:00:25,255
understand the use of multidimensional
thread grids and in particular,

7
00:00:25,255 --> 00:00:28,930
we're going to be discussing the use of
multidimensional block

8
00:00:28,930 --> 00:00:33,080
and thread indices, and we're also be
discussing how we

9
00:00:33,080 --> 00:00:36,940
can impact this map block and thread
indices to

10
00:00:36,940 --> 00:00:42,470
data indices to processes two dimensional
pictures and so on.

11
00:00:42,470 --> 00:00:45,780
And after this lecture you should be able
to easily

12
00:00:45,780 --> 00:00:50,360
generalize all the concepts into the
processing of three dimensional

13
00:00:50,360 --> 00:00:54,890
data structures using three dimensional
blocks and grids.

14
00:00:56,670 --> 00:01:02,300
This picture was shown earlier in the
introductory lecture where we

15
00:01:02,300 --> 00:01:08,260
assumed that the host has launched a
thread of a kernel one.

16
00:01:08,260 --> 00:01:11,120
And this generates a grid of threads.

17
00:01:11,120 --> 00:01:15,438
And we assume that the user has specified
that we're going to generate a

18
00:01:15,438 --> 00:01:18,950
two- dimensional grid which is reflected

19
00:01:18,950 --> 00:01:23,500
by the two-dimensional indices of the
blocks.

20
00:01:23,500 --> 00:01:30,050
And then we also assume that each block
has three dimensional threads.

21
00:01:30,050 --> 00:01:34,310
So each thread in this picture has three
indices.

22
00:01:34,310 --> 00:01:41,080
The convention that we're going to be
using is that for a for a two dimensional

23
00:01:41,080 --> 00:01:45,420
structure we're going to show x and y
indexes, the first

24
00:01:45,420 --> 00:01:48,900
index will be x, and the second index will
be y.

25
00:01:48,900 --> 00:01:52,410
And for a three dimensional structure,
we're going to show

26
00:01:52,410 --> 00:01:55,990
the indices in the order of x, y, and z.

27
00:01:55,990 --> 00:02:01,160
So the first index will be x, second index
will y, and the third index will be z.

28
00:02:01,160 --> 00:02:06,270
So, where, here, we have a very simple
example and now we're ready

29
00:02:06,270 --> 00:02:11,210
to look at a more practical example.
So we, we

30
00:02:11,210 --> 00:02:15,940
see here that a here a two dimensional
picture of

31
00:02:15,940 --> 00:02:20,960
pixels.
And we have 62 pixels in the y

32
00:02:20,960 --> 00:02:26,070
dim, dimension, and 72 6 pixels in the x
dimension.

33
00:02:26,070 --> 00:02:32,200
So we will be calling this picture a 62 by
76 picture.

34
00:02:32,200 --> 00:02:38,180
Which means that the first number in the,
in the

35
00:02:38,180 --> 00:02:42,240
configuration will be the number of pixels
in the y dimension.

36
00:02:42,240 --> 00:02:45,400
And the second number will be in the x
dimension.

37
00:02:45,400 --> 00:02:50,440
We assume that we're going to be using a
16 by 16 thread block.

38
00:02:50,440 --> 00:02:53,920
16 threads in the y dimension, and 16

39
00:02:53,920 --> 00:02:57,270
threads in the x dimension to process the
picture.

40
00:02:57,270 --> 00:03:02,420
Just like in the vector, addition example.
We need to make sure that we have enough

41
00:03:02,420 --> 00:03:08,460
threads in both y dimension and x
dimension to process all the pixels.

42
00:03:08,460 --> 00:03:13,060
So we're going to need to launch four
thread blocks

43
00:03:13,060 --> 00:03:17,980
in the y dimension to cover all the 62
pixels.

44
00:03:17,980 --> 00:03:21,540
When we launch four thread blocks, we have
64 threads.

45
00:03:21,540 --> 00:03:22,510
So we

46
00:03:22,510 --> 00:03:28,520
have all the 62 pixels covered, and we
also need to have 2 ex-, we need to have,

47
00:03:28,520 --> 00:03:31,740
we need to, we will have two extra threads

48
00:03:31,740 --> 00:03:35,620
in the y dimension that should not be
doing anything.

49
00:03:35,620 --> 00:03:40,330
And in the x dimension we're going to need
to launch five thread blocks.

50
00:03:40,330 --> 00:03:42,370
And that will give us 80 threads.

51
00:03:42,370 --> 00:03:47,615
So we could, we would cover all the 76
pixels and we will have

52
00:03:47,615 --> 00:03:53,290
four threads that should not do anything.
And this will be reflected in the way we

53
00:03:53,290 --> 00:03:58,320
write the kernel.
Before we look at the kernel code,

54
00:03:58,320 --> 00:04:03,430
I'd like to quickly review the the way C
and C plus

55
00:04:03,430 --> 00:04:08,800
plus lay out a two dimensional matrix, or
two dimensional

56
00:04:08,800 --> 00:04:12,800
array into the linear address space.
In the modern

57
00:04:12,800 --> 00:04:18,860
computers, all the memory data are stored
in a linear address space

58
00:04:18,860 --> 00:04:25,350
and even though the conceptually, we have
a two dimensional array in C,

59
00:04:25,350 --> 00:04:31,830
ultimately all the elements will be stored
in a linear order and this is called the

60
00:04:31,830 --> 00:04:37,990
layout and C adopts the tradition,
convention of Row-Major Layout.

61
00:04:37,990 --> 00:04:44,180
Which means that all the elements in a row
are preserved in their relative position.

62
00:04:44,180 --> 00:04:47,100
So all the neighboring elements in row
zero

63
00:04:47,100 --> 00:04:52,980
will be placed consecutively in the,
memory, linearized memory.

64
00:04:52,980 --> 00:04:58,530
And all the elements in row one will also
be placed in consecutive locations.

65
00:04:58,530 --> 00:05:04,110
So, in this particular case, whenever we
need to access a,

66
00:05:04,110 --> 00:05:09,390
a element using a linearized address.
We can generate that,

67
00:05:09,390 --> 00:05:13,730
address by multiplying the row index by
the

68
00:05:13,730 --> 00:05:19,600
width of the of the array, and plus the
column index.

69
00:05:19,600 --> 00:05:23,980
So, for example, you wanted to access
element

70
00:05:23,980 --> 00:05:29,370
M21.
We have a row index of two, and a column

71
00:05:29,370 --> 00:05:30,700
index of one.

72
00:05:30,700 --> 00:05:37,300
So we can multiply the, the, the row index
2 by the width, which is 4.

73
00:05:37,300 --> 00:05:42,250
And then we plus the column index, which
is 1, and this gives us 9.

74
00:05:42,250 --> 00:05:46,100
So if you look at the top of the picture,

75
00:05:46,100 --> 00:05:50,440
we show the linearized index for all the M
elements.

76
00:05:50,440 --> 00:05:54,580
The first four are zero, one, two, three,
and then

77
00:05:54,580 --> 00:05:59,880
the next ones are four, five, six, seven.
And so element

78
00:05:59,880 --> 00:06:04,170
nine is actually the original M two, one.
And

79
00:06:04,170 --> 00:06:09,180
this will be used for accessing all the
dynamically

80
00:06:09,180 --> 00:06:15,040
allocated arrays in in the CUDA kernels.

81
00:06:15,040 --> 00:06:19,630
And so the, even though, this is

82
00:06:19,630 --> 00:06:21,100
familiar to all of you.

83
00:06:21,100 --> 00:06:23,940
It's very important that you keep this in
mind

84
00:06:23,940 --> 00:06:26,750
when you look at any of the CUDA kernels.

85
00:06:26,750 --> 00:06:30,310
And, for those of you who came from
Fortran background.

86
00:06:30,310 --> 00:06:36,230
Fortran adopts a, nat, column nature
layout convention, where all the

87
00:06:36,230 --> 00:06:42,860
elements in the, in the row, column, will
have their relative positions preserved.

88
00:06:42,860 --> 00:06:44,890
So, in this particular case,

89
00:06:44,890 --> 00:06:50,070
if you look at the C, layout.
The, the adjacent

90
00:06:50,070 --> 00:06:55,720
elements in a column are actually laid out
in a, strided way,

91
00:06:55,720 --> 00:07:01,310
that are, have mult- so that they are with
a throw width away from each other.

92
00:07:01,310 --> 00:07:05,370
Whereas, if you had a Fortran program, all
these elements

93
00:07:05,370 --> 00:07:10,260
in the same column will actually be placed
in adjacent locations.

94
00:07:10,260 --> 00:07:15,340
Now that we, we understand how, CUDA, or C
and C

95
00:07:15,340 --> 00:07:20,790
Plus Plus linearize the a two dimensional
array.

96
00:07:20,790 --> 00:07:24,620
Now we're ready to look at a, piece of
kernel code.

97
00:07:24,620 --> 00:07:30,720
So here is a picture kernel, which is a
very, very simple kernel that what

98
00:07:30,720 --> 00:07:35,880
a, multiply the pixel, every pixel value
in a picture

99
00:07:35,880 --> 00:07:37,680
by 2.0.

100
00:07:37,680 --> 00:07:40,690
So, this is essentially magnifying the
value of

101
00:07:40,690 --> 00:07:44,640
all the pic, pixel, pixels in a picture.

102
00:07:44,640 --> 00:07:50,390
So, here we show that the the kernel
accepts four arguments.

103
00:07:50,390 --> 00:07:53,510
The first one is pointer to the input
picture.

104
00:07:53,510 --> 00:07:56,220
The second one is pointer to the output
picture.

105
00:07:56,220 --> 00:08:01,290
And the third one is the the number of
elements in the in

106
00:08:01,290 --> 00:08:06,030
the row, in the x direction.
And the fourth one is the

107
00:08:06,030 --> 00:08:10,510
number of elements in the in the vertical
or

108
00:08:10,510 --> 00:08:15,790
y dimension.
So when we

109
00:08:15,790 --> 00:08:21,580
determine which pixel a particular thread,
should, process.

110
00:08:21,580 --> 00:08:27,060
We have a, y index generation, and x index
generation.

111
00:08:27,060 --> 00:08:30,470
The y index generation generates a row
index.

112
00:08:30,470 --> 00:08:36,030
Remember that row index selects one of the
rows in that, in the, y dimension.

113
00:08:36,030 --> 00:08:39,270
So that's why we're using the y indices.

114
00:08:39,270 --> 00:08:41,300
So we will just use our familiar

115
00:08:41,300 --> 00:08:48,550
pattern of blockIdx.y time blockDim.y plus
threadIdx.y.

116
00:08:48,550 --> 00:08:52,420
And this ensures that every possible row
will

117
00:08:52,420 --> 00:08:57,490
have a, a thread that that covers that
row.

118
00:08:57,490 --> 00:09:01,390
And then we have the same pattern for the
x dimension.

119
00:09:01,390 --> 00:09:07,340
And this is pretty much identical to the
vector addition case, where every element

120
00:09:07,340 --> 00:09:13,450
in the explanation will have a thread to
cover that particular pixel value.

121
00:09:13,450 --> 00:09:17,760
So, now that we have generated appropriate
row index and

122
00:09:17,760 --> 00:09:19,220
column index.

123
00:09:19,220 --> 00:09:21,170
And all the threads can now use the

124
00:09:21,170 --> 00:09:24,660
row index and column index to do the
processing.

125
00:09:24,660 --> 00:09:31,260
We can first test whether a whether the
thread is within the valid range.

126
00:09:31,260 --> 00:09:35,580
So we will first test if row index is less
than m.

127
00:09:35,580 --> 00:09:41,630
So m is the number of pixels in the y
dimension.

128
00:09:41,630 --> 00:09:43,180
So the row index needs

129
00:09:43,180 --> 00:09:45,100
to be within that range.

130
00:09:45,100 --> 00:09:49,680
And also, we will test whether the column
index is less than n.

131
00:09:49,680 --> 00:09:52,460
N is the number of pixels in the x
dimension.

132
00:09:52,460 --> 00:09:57,850
So whenever both conditions are satisfied,
we know that a thread is, we think, the

133
00:09:57,850 --> 00:10:03,950
valid range of a two-dimensional area
where there, there are valid pixels.

134
00:10:03,950 --> 00:10:09,370
So now the thread can proceed to to
multiply the

135
00:10:09,370 --> 00:10:12,870
the input pixel by two and assign that to
the output.

136
00:10:12,870 --> 00:10:18,130
So, here, you show, you see the linearized
addressing where, the thread

137
00:10:18,130 --> 00:10:23,370
is going to use row multiplied by n, which
is number of pixels in each row.

138
00:10:23,370 --> 00:10:26,960
So this is the width, plus, column index.

139
00:10:26,960 --> 00:10:29,420
And, you're, we're also going to use
exactly the

140
00:10:29,420 --> 00:10:32,740
same expression to write into the output
matrix.

141
00:10:32,740 --> 00:10:34,510
So, or output array.

142
00:10:34,510 --> 00:10:41,400
So this will allow every thread to pick
the correct its assigned row and

143
00:10:41,400 --> 00:10:46,730
column index position pixel, multiply by
2, and write that into the output array.

144
00:10:50,640 --> 00:10:58,110
So here we show the host code that that
launches the pic, picture kernel.

145
00:10:58,110 --> 00:11:02,990
So, we assume that the host host in the
host code, the variable

146
00:11:02,990 --> 00:11:08,330
n, host the number of pixels in the y
dimension and

147
00:11:08,330 --> 00:11:13,540
the n shows the host the number of pixels
in the x dimension [COUGH].

148
00:11:13,540 --> 00:11:16,710
In our example of

149
00:11:16,710 --> 00:11:24,808
62 times 76, m will be equal to 62 and n
will be equal to [COUGH]

150
00:11:24,808 --> 00:11:26,205
76.

151
00:11:26,205 --> 00:11:30,710
So now we were going to launch the kernel.

152
00:11:30,710 --> 00:11:33,600
And this is how we make sure that all the
kernels,

153
00:11:36,410 --> 00:11:39,480
excuse me, will have enough threads.

154
00:11:39,480 --> 00:11:41,860
The kernel will be executed by enough
threads to

155
00:11:41,860 --> 00:11:45,530
cover both the y dimension and the x
dimension.

156
00:11:45,530 --> 00:11:49,470
So, we take, the, n value and we are

157
00:11:49,470 --> 00:11:53,200
doing a seeding function, essentially a
seeding function of

158
00:11:53,200 --> 00:11:56,660
n divided by 16, this n minus the familiar

159
00:11:56,660 --> 00:11:59,905
expression of n minus 1 divided by 16 plus
1.

160
00:11:59,905 --> 00:12:01,640
[COUGH] We

161
00:12:01,640 --> 00:12:08,136
do the same thing for x.
So we have m minus 1 divided by 16 plus 1.

162
00:12:08,136 --> 00:12:13,790
[COUGH] Excuse me.
[SOUND] And

163
00:12:13,790 --> 00:12:19,480
then, with each block, we're going to just
have 16 threads in the x dimension.

164
00:12:19,480 --> 00:12:23,950
16 threads in the y, dimension.
And one thread in the z dimension.

165
00:12:23,950 --> 00:12:26,920
So this gives us a two dimensional grid, a
two dimensional

166
00:12:28,780 --> 00:12:32,410
block structure to cover the picture
given.

167
00:12:32,410 --> 00:12:36,980
And then we, we use the configuration to
launch the kernel,

168
00:12:36,980 --> 00:12:41,780
and then we call the PictureKernel with
the DimGrid and DimBlock.

169
00:12:41,780 --> 00:12:44,320
Now we give the the four arguments.

170
00:12:44,320 --> 00:12:49,606
The pointer to the input picture, pointer
to the output picture, m and n.

171
00:12:49,606 --> 00:12:54,670
So this will ensure that the kernel will
have enough

172
00:12:54,670 --> 00:13:00,050
threads to cover both dimensions and then
you'll, we will be able to

173
00:13:00,050 --> 00:13:03,550
in the kernel we'll be able to test
whether each thread should be

174
00:13:03,550 --> 00:13:08,750
acted, acting on a pixel or the thread is
outside the valid

175
00:13:08,750 --> 00:13:14,420
dimension, so that, valid range, so that
the thread should not take any action.

176
00:13:16,740 --> 00:13:19,290
So this picture shows a little bit more

177
00:13:19,290 --> 00:13:23,280
analysis of what actually happens at run
time.

178
00:13:23,280 --> 00:13:29,450
So we launched four thread blocks by five
thread blocks.

179
00:13:29,450 --> 00:13:33,450
So we actually end up launching 20 thread
blocks.

180
00:13:33,450 --> 00:13:38,400
And then we're actually going to have four
different situations,

181
00:13:38,400 --> 00:13:42,560
as far as the execution of the thread
blocks are concerned.

182
00:13:42,560 --> 00:13:46,930
The first case is in the upper left
corner.

183
00:13:46,930 --> 00:13:51,460
Where all the threads will have both their
y

184
00:13:51,460 --> 00:13:55,570
index, and x index, to be in the valid
range.

185
00:13:55,570 --> 00:14:01,810
So, all the threads within these three by
four, 12, all these 12 thread

186
00:14:01,810 --> 00:14:08,050
blocks will have all their threads, fully
executing, and fully processing

187
00:14:08,050 --> 00:14:10,170
the picture.

188
00:14:10,170 --> 00:14:13,900
And the second one, that's the one on the
upper right side.

189
00:14:13,900 --> 00:14:19,080
These threads will find their y indices
within the valid range, but

190
00:14:19,080 --> 00:14:23,370
x indices in the not in the valid range,
well some of

191
00:14:23,370 --> 00:14:26,600
the x indices will be in the valid range,
and some of

192
00:14:26,600 --> 00:14:30,290
the x x in, indices will not be in the
valid range.

193
00:14:30,290 --> 00:14:33,930
So, in this case, some of the threads

194
00:14:33,930 --> 00:14:40,318
in that in, in the, on the right end, in
these thread blocks will not execute.

195
00:14:40,318 --> 00:14:45,080
And then we have the third, situation
where, all the threads in

196
00:14:45,080 --> 00:14:49,800
the x dimension will find their x indices
in the valid range.

197
00:14:49,800 --> 00:14:53,070
But some of the threads in the y dimension

198
00:14:53,070 --> 00:14:57,510
will find their index to be outside the
valid range.

199
00:14:57,510 --> 00:14:58,570
So in this,

200
00:14:59,670 --> 00:15:04,840
in the third case, we will have this
thread blocks will have some of their

201
00:15:04,840 --> 00:15:12,080
threads in the bottom of their of the
thread block not processing any pixels.

202
00:15:12,080 --> 00:15:14,740
And finally, we have the fourth case.

203
00:15:14,740 --> 00:15:22,000
Where the threads, some of the threads in
the x, will find their x index

204
00:15:22,000 --> 00:15:24,740
to be in the, outside the valid range.
And some of the

205
00:15:24,740 --> 00:15:28,960
threads will find their y index to be
outside the valid range.

206
00:15:28,960 --> 00:15:33,870
And in fact some of the threads in the
very corner, the lower right corner, will

207
00:15:33,870 --> 00:15:40,050
find their, both of their x and y indices
outside the value range.

208
00:15:40,050 --> 00:15:45,210
So these, you know, situations will all
happen.

209
00:15:45,210 --> 00:15:50,190
And this is an important picture for you
to remember.

210
00:15:50,190 --> 00:15:53,780
When we just begin to discuss the
performance implications.

211
00:15:53,780 --> 00:15:59,200
For now, it suffices for you to see, to
remember that whenever we do,

212
00:15:59,200 --> 00:16:04,960
we have a process, a two dimensional
picture and we use boundary check.

213
00:16:04,960 --> 00:16:10,170
To to check whether each thread has its y
index or x

214
00:16:10,170 --> 00:16:15,840
index within the valid range then we will
end up with these four possible

215
00:16:15,840 --> 00:16:18,200
situations, or four possible cases.

216
00:16:20,960 --> 00:16:23,755
And this brings us to the conclusion of
this lecture.

217
00:16:23,755 --> 00:16:29,371
And for those of you who would like to
understand more about the material,

218
00:16:29,371 --> 00:16:35,055
I would like to encourage you to read
sections 4.1 and 4.2 of the textbook.

219
00:16:35,055 --> 00:16:35,072
Thank you.